
In this tutorial we will discuss how you can apply the linear regression and k-Nearest Neighbours methods in Python. We will also do some exploratory data analysis and use TPOT, an useful automated machine learning tool.
Ames housing dataset
Some exploratory data analysis
Preparing the data for machine learning
Linear Regression
k-Nearest Neighbours
Model Evaluation
This notebook relies on the following imports and settings. We load other libraries in context to make it clear what we are using them for.
# Packages
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') # this is to clear the warnings from the final version, usually we should leave this on
# Plot settings
sns.set_context('notebook') # optimises figures for notebook display
sns.set_style('ticks') # set default plot style
colours = ['#4E79A7','#F28E2C','#E15759','#76B7B2','#59A14F',
'#EDC949','#AF7AA1','#FF9DA7','#9C755F','#BAB0AB']
sns.set_palette(colours) # set custom color scheme
%matplotlib inline
plt.rcParams['figure.figsize'] = (9, 6)
We use the Ames Housing dataset from De Cock (2011), which contains data about residential property sales in a North American city.
The dataset has 81 predictor variables of all standard types (continuous, discrete, nominal, and ordinal), Check the documentation for a description of the dataset.
The objective is to predict the sale price (the last column in the dataset). Our metric for evaluating performance will be the root mean squared error on the log scale. That implies that we care about the percentage errors in the predictions (approximately), rather than the errors measured in dollars.
import pandas as pd
# The following assumes that the dataset is on a subfolder called Data
data=pd.read_csv('Data/AmesHousing.csv')
data.head()
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20 | RL | 141.0 | 31770 | Pave | NaN | IR1 | Lvl | AllPub | Corner | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2010 | WD | Normal | 215000 |
| 1 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal | 105000 |
| 2 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | Corner | ... | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal | 172000 |
| 3 | 20 | RL | 93.0 | 11160 | Pave | NaN | Reg | Lvl | AllPub | Corner | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 244000 |
| 4 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal | 189900 |
5 rows × 80 columns
To keep things simpler this time, we only consider two predictors: ground living area and the number of garage spots. Next week, we'll study how to work with many predictors of all types.
variables = ['GrLivArea', 'GarageCars', 'SalePrice']
data = data.loc[:, variables]
data.head()
| GrLivArea | GarageCars | SalePrice | |
|---|---|---|---|
| 0 | 1656 | 2.0 | 215000 |
| 1 | 896 | 1.0 | 105000 |
| 2 | 1329 | 1.0 | 172000 |
| 3 | 2110 | 2.0 | 244000 |
| 4 | 1629 | 2.0 | 189900 |
Exploratory data analysis (EDA) is about understanding our data. Some of the goals of EDA are to discover useful information for building better machine learning models, identify potential problems to be addressed, and uncover interesting insights.
We'll talk more about EDA next week, but let's start doing it now.
Today we'll be using the dataprep package to assist us with EDA. Click on the ! symbol that appears on some of the plots and it tell you some insights about the data.
The distribution of sale prices is very right skewed, with some outliers.
from dataprep.eda import plot
plot(data, 'SalePrice')
| Distinct Count | 1032 |
|---|---|
| Unique (%) | 35.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 45.8 KB |
| Mean | 180796.0601 |
| Minimum | 12789 |
| Maximum | 755000 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 12789 |
|---|---|
| 5-th Percentile | 87500 |
| Q1 | 129500 |
| Median | 160000 |
| Q3 | 213500 |
| 95-th Percentile | 335000 |
| Maximum | 755000 |
| Range | 742211 |
| IQR | 84000 |
| Mean | 180796.0601 |
|---|---|
| Standard Deviation | 79886.6924 |
| Variance | 6.3819e+09 |
| Sum | 5.2973e+08 |
| Skewness | 1.7426 |
| Kurtosis | 5.1081 |
| Coefficient of Variation | 0.4419 |
We can apply a log transformation to make a right-skewed variable closer to symmetrically distributed. Compare the two KDE and normal Q-Q plots.
import numpy as np
data['LogSalePrice'] = np.log(data['SalePrice'])
plot(data, 'LogSalePrice')
| Distinct Count | 1032 |
|---|---|
| Unique (%) | 35.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 45.8 KB |
| Mean | 12.021 |
| Minimum | 9.4563 |
| Maximum | 13.5345 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 9.4563 |
|---|---|
| 5-th Percentile | 11.3794 |
| Q1 | 11.7714 |
| Median | 11.9829 |
| Q3 | 12.2714 |
| 95-th Percentile | 12.7219 |
| Maximum | 13.5345 |
| Range | 4.0781 |
| IQR | 0.5 |
| Mean | 12.021 |
|---|---|
| Standard Deviation | 0.4076 |
| Variance | 0.1661 |
| Sum | 35221.4383 |
| Skewness | -0.01479 |
| Kurtosis | 1.5093 |
| Coefficient of Variation | 0.03391 |
The ground living area is also very right-skewed. We'll leave this variable as is today, but we'll see later that it can be useful to transform skewed input variables too depending on the model to be trained.
plot(data, 'GrLivArea')
| Distinct Count | 1292 |
|---|---|
| Unique (%) | 44.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 45.8 KB |
| Mean | 1499.6904 |
| Minimum | 334 |
| Maximum | 5642 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 334 |
|---|---|
| 5-th Percentile | 861 |
| Q1 | 1126 |
| Median | 1442 |
| Q3 | 1742.75 |
| 95-th Percentile | 2463.1 |
| Maximum | 5642 |
| Range | 5308 |
| IQR | 616.75 |
| Mean | 1499.6904 |
|---|---|
| Standard Deviation | 505.5089 |
| Variance | 255539.2353 |
| Sum | 4.3941e+06 |
| Skewness | 1.2735 |
| Kurtosis | 4.1287 |
| Coefficient of Variation | 0.3371 |
The GarageCars column has a missing value. Based on studying full dataset, this missing value occurs for a house that does not have a garage. Therefore, we replace it with a zero in the following cell.
plot(data, 'GarageCars')
| Distinct Count | 6 |
|---|---|
| Unique (%) | 0.2% |
| Missing | 1 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 45.8 KB |
| Mean | 1.7668 |
| Minimum | 0 |
| Maximum | 5 |
| Zeros | 157 |
| Zeros (%) | 5.4% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 1 |
| Median | 2 |
| Q3 | 2 |
| 95-th Percentile | 3 |
| Maximum | 5 |
| Range | 5 |
| IQR | 1 |
| Mean | 1.7668 |
|---|---|
| Standard Deviation | 0.7606 |
| Variance | 0.5785 |
| Sum | 5175 |
| Skewness | -0.2197 |
| Kurtosis | 0.2425 |
| Coefficient of Variation | 0.4305 |
data['GarageCars'] = data['GarageCars'].fillna(0)
We now examine the pairwise relationships between the response variable and the predictors.
Plotting the sale price against the ground living area reveals funnel-shaped plot of the type that we discussed in the lecture, indicating non-constant error variance. This tends to occur together with non-linearity and skewed errors.
We also observe pronounced outliers in the plot, which is a cause for concern because they can have a disproportionate effect on our trained models.
fig, ax = plt.subplots()
plt.scatter(data['GrLivArea'], data['SalePrice'], s=25) # the s option specifies the size of the dot
ax.set_xlabel('Ground living area')
ax.set_ylabel('Sale price')
sns.despine()
plt.show()
As discussed in the lecture, a log transformation of the response can mitigate issues of non-constant error variance, non-linearity and skewed errors, though we're still left with some outliers in this case.
fig, ax = plt.subplots()
plt.scatter(data['GrLivArea'], data['LogSalePrice'], s=25) # the s option specifies the size of the dot
ax.set_xlabel('Ground living area')
ax.set_ylabel('Log sale price')
sns.despine()
plt.show()
The next plot shows the relationship between the log sale price and the number of garage spots. It suggests a moderately strong association. Perhaps surprisingly, the Pearson correlation coefficient (0.675) is almost as high as that for the ground living area (0.696).
sns.boxplot(x = data['GarageCars'], y = data['LogSalePrice'], boxprops=dict(alpha=.9))
sns.despine()
plt.show()
Note that there a very few houses with more than three garage spots.
data['GarageCars'].value_counts().sort_index()
0.0 158 1.0 778 2.0 1603 3.0 374 4.0 16 5.0 1 Name: GarageCars, dtype: int64
To be able to compare the predictive performance of different methods, we randomly split the data into training and validation sets. We'll use the training set to fit the models and predict the observations in the validation set.
We use the Scikit-Learn train_test_split function to split the data.
Below, we specify that the training set will contain 70% of the data. The random_state is an arbitrary number. If we run the analysis again with the same random_state we will get the same training and validation sets, even though the data split is random. This is important because we should always be able to reproduce our work.
from sklearn.model_selection import train_test_split
# I like to work with the indexs
index_train, index_valid = train_test_split(data.index, train_size=0.7, random_state=42)
# Write training and test sets
train = data.loc[index_train, :].copy() # copy() does not make a difference here but it's better to be explicit
valid = data.loc[index_valid, :].copy()
train.head()
| GrLivArea | GarageCars | SalePrice | LogSalePrice | |
|---|---|---|---|---|
| 2210 | 2002 | 3.0 | 145000 | 11.884489 |
| 782 | 1473 | 1.0 | 143000 | 11.870600 |
| 2310 | 1525 | 2.0 | 183500 | 12.119970 |
| 299 | 1191 | 2.0 | 162500 | 11.998433 |
| 2423 | 1414 | 2.0 | 178740 | 12.093688 |
Next, we build the design matrices and response vectors for the training and validation sets. Because the k-Nearest Neighbours method will require all predictors to be on the same scale, we use the StandardScaler from scikit-learn to standardise the predictors. Each resulting column has a sample mean of zero and a standard deviation of one on the training set.
predictors = ['GrLivArea', 'GarageCars']
response = 'LogSalePrice'
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(train[predictors])
X_valid = scaler.transform(valid[predictors])
y_train = np.ravel(train[response])
y_valid = np.ravel(valid[response])
Scikit-learn allows us to train and use a wide range of machine learning algorithms using a simple API:
We use the LinearRegression class from Scikit-Learn to fit a linear regression.
from sklearn.linear_model import LinearRegression
ols = LinearRegression()
ols.fit(X_train, y_train)
LinearRegression()
To compute prediction, we simply have to do as follows:
y_pred = ols.predict(X_valid)
Residual diagnostics helps us to check whether our model is appropriate for the data and find ways to improve it.
We compute the residuals as follows.
y_fitted = ols.predict(X_train)
resid = y_train - y_fitted
First, we investigate the distribution. As we might have expected based on the EDA, the model is subject to a few outliers. The outliers and perhaps other factors make the distribution of the residuals very left-skewed, which is not ideal. we'll discuss some tools to handle outliers next week.
# Dataprep expects a pandas dataframe
plot(pd.DataFrame(resid, columns=['Residuals']), 'Residuals')
| Distinct Count | 2038 |
|---|---|
| Unique (%) | 99.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 32.0 KB |
| Mean | 3.3345e-16 |
| Minimum | -2.3654 |
| Maximum | 0.8689 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 924 |
| Negatives (%) | 45.1% |
| Minimum | -2.3654 |
|---|---|
| 5-th Percentile | -0.4417 |
| Q1 | -0.1022 |
| Median | 0.02233 |
| Q3 | 0.1484 |
| 95-th Percentile | 0.3517 |
| Maximum | 0.8689 |
| Range | 3.2343 |
| IQR | 0.2506 |
| Mean | 3.3345e-16 |
|---|---|
| Standard Deviation | 0.2502 |
| Variance | 0.06258 |
| Sum | 6.839e-13 |
| Skewness | -1.635 |
| Kurtosis | 9.7745 |
| Coefficient of Variation | 7.5022e+14 |
Plotting the fitted values against the residuals suggest no major additional issues besides the outliers.
# tip: we usually create functions to automate plots that we use often
fig, ax = plt.subplots()
sns.regplot(y_fitted, resid, ci=None, lowess= True, scatter_kws={'s':25, 'color': colours[2]},
line_kws={'color':'black', 'alpha':0.5})
ax.set_xlabel('Fitted values')
ax.set_ylabel('Residuals')
ax.set_title('Fitted values against residuals', fontsize=16)
sns.despine()
plt.show()
There may be some weak nonlinearity in the relationship between ground living area and the residuals. It's probably best to transform this predictor as we'll do next week.
fig, ax = plt.subplots()
sns.regplot(train['GrLivArea'], resid, ci=None, lowess=True, scatter_kws={'s':25, 'color':colours[2]},
line_kws={'color':'black', 'alpha':0.5})
ax.set_xlabel('Ground Living Area')
ax.set_ylabel('Residuals')
ax.set_title('Ground living area against residuals', fontsize=16)
sns.despine()
plt.show()
We instantiate the kNN method based the Euclidean distance as follows. We try two different configurations for the number of neighbours.
from sklearn.neighbors import KNeighborsRegressor
knn1 = KNeighborsRegressor(n_neighbors = 5)
knn1.fit(X_train, y_train)
knn2 = KNeighborsRegressor(n_neighbors = 15)
knn2.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=15)
Let's also try kNN method with the Mahalanobis distance. First, we compute the covariance matrix for the predictors.
Sigma = np.cov(X_train, rowvar=False)
We then instantiate the model as follows.
knn3 = KNeighborsRegressor(n_neighbors = 15, metric='mahalanobis',
metric_params={'V': Sigma})
knn3.fit(X_train, y_train)
KNeighborsRegressor(metric='mahalanobis',
metric_params={'V': array([[1.0004878 , 0.47195312],
[0.47195312, 1.0004878 ]])},
n_neighbors=15)
Later in the unit we'll discuss how you can systematically select the number of neighbours.
Automated Machine Learning (AutoML) is an area of machine learning that seeks to automate one or multiple steps in the development of machine learning pipelines, for example data preparation and model selection.
The promise of AutoML is to allow even non-experts to build useful machine learning models with minimal effort. AutoML is a very active area of research in machine learning, and there has been progress in recent years.
Because the demand for machine learning experts outstrips supply, there are strong incentives for the development of AutoML frameworks. Well-known commercial products DataRobot AutoML, Google Cloud AutoML and H2O Driverless AI.
Fortunately, there are also excellent open-source AutoML tools that can increase your productivity wherever you are in you machine learning.
Today we'll use TPOT, which uses an optimisation technique called genetic programming to search over scikit-learn pipelines.
%%time
from tpot import TPOTRegressor
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42, n_jobs=2)
tpot.fit(X_train, y_train)
Generation 1 - Current best internal CV score: -0.05739480387550144 Generation 2 - Current best internal CV score: -0.05739480387550144 Generation 3 - Current best internal CV score: -0.054920333601546203 Generation 4 - Current best internal CV score: -0.054920333601546203 Generation 5 - Current best internal CV score: -0.054920333601546203 Best pipeline: ElasticNetCV(RBFSampler(input_matrix, gamma=0.9), l1_ratio=1.0, tol=1e-05) Wall time: 2min 16s
TPOTRegressor(generations=5, n_jobs=2, population_size=50, random_state=42,
verbosity=2)
The higher the number of generations and population size, the better TPOT will tend to work. However, it will take longer to run. The settings above are well-below the default of 100 so that the cell doesn't take too long to run.
Finally, we compare all the models on the validation set. The model found by TPOT has the highest predictive accuracy. The variations of KNN with 15 neighbours are not far behind.
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Initialise table
columns=['RMSE', 'R-Squared', 'MAE']
rows=['Linear Regression', 'KNN (k = 5)', 'KNN (k = 15)', 'KNN (Mahalanobis, k = 15)', 'TPOT']
results = pd.DataFrame(0.0, columns=columns, index=rows)
# List of models
models = [ols, knn1, knn2, knn3, tpot]
# Compute predictions and metrics
for i, model in enumerate(models):
y_pred = model.predict(X_valid)
results.iloc[i, 0] = np.sqrt(mean_squared_error(y_valid, y_pred))
results.iloc[i, 1] = r2_score(y_valid, y_pred)
results.iloc[i, 2] = mean_absolute_error(y_valid, y_pred)
results.round(3)
| RMSE | R-Squared | MAE | |
|---|---|---|---|
| Linear Regression | 0.241 | 0.666 | 0.177 |
| KNN (k = 5) | 0.248 | 0.647 | 0.188 |
| KNN (k = 15) | 0.231 | 0.692 | 0.173 |
| KNN (Mahalanobis, k = 15) | 0.232 | 0.692 | 0.173 |
| TPOT | 0.225 | 0.709 | 0.168 |